The dataset is derived from emergency department of a hospital and consists of following information:
#Load libraries
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from datetime import datetime
import numpy as np
This section of the notebook performs exploration of the dataset. Since the data represents functioning of hospital emergency department, questions have been grouped in three categories:
# Load Dataset
filename = 'Generic ED 2009.xlsx'
dataset = pd.read_excel(filename)
dataset.head()
dataset.columns
# Checking for null values
dataset.isnull().sum()
dataset.dtypes
from pandas_profiling import ProfileReport
ProfileReport(dataset, title="Pandas Profiling Report")
It is seen there are some null values in a few columns. Following cells fill up null values as explained below:
dataset.loc[dataset['TimeDiff TreatDrNr-Act. Depart (mins)'].isnull(),'TimeDiff TreatDrNr-Act. Depart (mins)'] = dataset.loc[dataset['TimeDiff TreatDrNr-Act. Depart (mins)'].isnull(),'TimeDiff Arrival-Actual Depart (mins)']/2
#check for most diagnosis desc
diag_desc = dataset[dataset['Presenting Complaint Code'].isnull()].loc[:,'Diagnosis Desc.'].unique().tolist()
diag_desc
diag_desc = dataset[dataset['Presenting Complaint Code'].isnull()].loc[:,'Diagnosis Desc.'].unique().tolist()
#diag_desc
for diag in diag_desc:
df_red = dataset[dataset['Diagnosis Desc.']==diag].groupby(['Presenting Complaint Desc.','Presenting Complaint Code'])['MRN'].count().reset_index()
max_count = df_red.MRN.max()
#print(df_red[df_red.MRN==max_count].iloc[0,0])
val = df_red[df_red.MRN==max_count].iloc[0,0]
val1 = df_red[df_red.MRN==max_count].iloc[0,1]
dataset.loc[np.array(dataset['Presenting Complaint Code'].isnull()) & np.array(dataset['Diagnosis Desc.']==diag),'Presenting Complaint Desc.']= val
dataset.loc[np.array(dataset['Presenting Complaint Code'].isnull()) & np.array(dataset['Diagnosis Desc.']==diag),'Presenting Complaint Code']= val1
#Checking combination of diagnosis code and diagnosis description together
dataset.groupby(by=['Diag Code', 'Diagnosis Desc.'])['Diag Code'].count()
# Correct spelling errors
dataset.loc[dataset['Diagnosis Desc.']=='FOOD POISOINING','Diagnosis Desc.']='FOOD POISONING'
sns.distplot(dataset['TimeDiff Arrival-Actual Depart (mins)'])
dataset['TimeDiff Arrival-Actual Depart (mins)'].describe()
dataset['Diagnosis Desc.'].value_counts()
plt.figure(figsize=(25,200))
sns.set(font_scale=1.5)
sns.boxplot(x="TimeDiff Arrival-Actual Depart (mins)",y='Diagnosis Desc.', data=dataset,orient="h")
plt.figure(figsize=(15,8))
sns.boxplot(x = ' Age (yrs)', y= "TimeDiff Arrival-Actual Depart (mins)",data=dataset)
plt.figure(figsize=(15,8))
sns.boxplot(x = 'Triage Priority', y= "TimeDiff Arrival-Actual Depart (mins)",data=dataset)
plt.figure(figsize=(10,10))
sns.scatterplot(x = 'TimeDiff TreatDrNr-Act. Depart (mins)', y= "TimeDiff Arrival-Actual Depart (mins)",data=dataset)
dataset.loc[:,['TimeDiff TreatDrNr-Act. Depart (mins)', "TimeDiff Arrival-Actual Depart (mins)"]].corr()
dataset['Presenting Complaint Desc.'].value_counts()
plt.figure(figsize=(25,200))
sns.set(font_scale=1.5)
sns.boxplot(x="TimeDiff Arrival-Actual Depart (mins)",y='Presenting Complaint Desc.', data=dataset,orient="h")
pat_date = dataset['Arrival Date'].apply(lambda x: x.strftime('%Y-%m-%d'))
pat_date.value_counts(sort=False).reset_index()
pat_date = pat_date.value_counts(sort=False).reset_index()
pat_date.columns = ['Arrival Date','patient_count']
pat_date['Arrival Date'] = pd.to_datetime(pat_date['Arrival Date'],format = '%Y-%m-%d')
pat_date.sort_values(by=['Arrival Date'], inplace=True)
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(111)
plt.plot(pat_date['Arrival Date'],pat_date['patient_count'])
ax.set_xlabel('Date')
ax.set_ylabel('Number of Patients per day')
Efficient running of emergency department needs planning of resources, availability of adequate hospital staff etc. This can be facilitated if an estimate is available on the total time a patient is expected to spend in the hospital after arrival. Exploratory Data Analysis for this dataset showed dependency of total treatment time on many factors like diagonsis, age of patient, complaint description etc.
In addition to above, planning of hospital operation can be further improved if total demand in terms of number of patients per day can be forecasted using historical data. It is seen that there is a periodic dependency in the number of patients per day with time of the year.
Using these insights, it is proposed to devise the following two pipelines :
dataset['Diagnosis Desc.'].value_counts()
dataset['MRN'].value_counts()
dataset['Presenting Complaint Desc.'].value_counts()
dataset['Departure Status Desc.'].value_counts()
dataset['Depart Status Code'].value_counts()
dataset['Depart. Dest. Desc.'].value_counts()
df = dataset[dataset['Diagnosis Desc.']=='DID NOT WAIT']
df.head()
sns.distplot(df['TimeDiff TreatDrNr-Act. Depart (mins)'])
df['Presenting Complaint Desc.'].value_counts()
df1=df['Presenting Complaint Desc.'].value_counts()
sns.distplot(df[' Age (yrs)'])
plt.figure(figsize=(20,20))
sns.barplot(df1.values,df1.index)
import sweetviz as sv
my_report = sv.analyze(dataset, target_feat='TimeDiff Arrival-Actual Depart (mins)')
my_report.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"